16S Data Processing Training

Author

Katie McCauley from the Bioinformatics and Computational Biosciences Branch (BCBB)

Thank you for joining us for this 16S rRNA microbiome data processing tutorial! Before we jump in, let’s start by going over our learning objectives for the technical component.

This tutorial is designed to introduce you to the steps involved in generating data from DADA2. While you may never actually create an ASV table using R code again (perhaps preferring Nephele’s code-free DADA2 pipeline1), we will hopefully introduce you to the steps involved and points to watch out for or consider when choosing parameters or evaluating your results.

1 Quick Note about ASVs vs OTUs

What we’ll cover today are considered Amplicon Sequence Variants or Sequence Variants because they have the ability to differ by as little as one nucleotide (we’ll cover how this happens later). Before DADA2 was widely accepted, though, the field primarily used Operational Taxonomic Units or OTUs. This was based on clustering sequences with 97% similarity. Meaning:

  • Take your most frequent sequence and consider it the first “seed”

  • Take your next most-frequent sequence and determine how similar it is to the first sequence. If it’s 97% similar, consider it the “same” as the first sequence and if it’s less than that, consider it a new “seed”

  • Repeat for all dereplicated sequences, checking against all other “seed” sequences

The 97% cutoff was chosen because it was presumed to account for potential PCR error in the sequence data, but often was fairly greedy. The figure below from Benjamin Callahan’s supplemental material for the DADA2 paper2 help describe this comparison:

Comparing OTUs to ASVs

2 A Primer on Sequencing

A short video from Illumina on “Sequencing by Synthesis”

Indexed Sequencing Overview

When the data is pulled from the sequencer, it typically arrives in BCL format, which is a highly compressed and proprietary Illumina format from which we need to create human- and computer-readable files (FASTQ). Illumina now frequently provides individual forward and reverse sequence files for each sample (called “Demultiplexed”), and this is the format that gets uploaded to public repositories like the Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA).

Library Multiplexing Overview

This is what a “FASTQ” file ends up looking like:

@M03213:59:000000000-AWR6D:1:1101:12406:1145 2:N:0:NCCTGAGC+NTATTAAG
NGACTACTGGGGTTTCTAATCCTGTTTGCTCCCCACGCTTTCGCACATGAGCGTCAGTACATTCCCAAGNGGCTGCCTTCGCCTTCGGTATTCCTCCACATCTCTACGCNTTTCACCGCTACACGTGGAATTCTACCCCTCCCTAAAGTACTCTAGATTCCCAGTCTGAAATGCAATTCCCAGGTTAAGCCCGGGGCTTTCACACCTCACTTAAAAATCCGCCTGCGTGCCCTTTACGCCCAGTTATTCCGATTAACGCT
+
#8ACCGGGGGGGFFGGGGFGGGGGGGGGGGGGGGGGGGGGGGEGGFFGGGGGGGGGGGGGGFGFFGG<E#:BFFGGGFGGGGGCGGFEFGFFGGGGG<CFFCFGGGGGG#99@FFGGEGBGGFGGF8CFFFEFGGG<=9DC>DDGGD?C=,;EGFGBFDGFFGGGGCC;@EEFGGGGFGGGGGGGGGFGGFDEGGGAADE5;EEB*07/9<FFCFGFGD@=@EDFF>7>9;C?E</2(2.6;<=)7,9<?(*)6(7,,(-
@M03213:59:000000000-AWR6D:1:1101:9817:1174 2:N:0:NCCTGAGC+CTATTAAG
NGACTACTGGGGTTTCTAATCCTGTTCGCTACCCACGCTTTCGAGCCTCAGCGTCAGTTACAAGCCAGAGAGCCGCTTTCGCCACAGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACTCTCCCCTCTTGCACTCAAGTTAAACAGTTTCCAAAGCAAACTATGGTTGAGCCACAGCCTTTGACTTCAGACTTATCTAACCGCCTGCGCTCGCTTTCCGCCCACTAAATCCGTATAACTCTCG
+
#8ACCGGGGGGGFCGGGGGGGGGGGGFGGGGGGGGGGGGGGGGFGGGGFGFFGGGFGFGGGGGGG7CFGCFFGGGGBEGGGGGGGG?EGGGGGFGGGGGGGGFGGGGGGGE@DFAFGGGFGEGGGGGGGGGF<,@,DDFGGDGDG=EFGGGFF,=D?8DF,?EGGCFCF,DFFGGFCDGFGG8@E3<?FF8DG8BFDFEEFCBEFA7;@6;A@CGGC7915>8)702/8:4*4A+7=;((/(,/6<29(((*.//,,/)/(

3 Find and Organize our Data

Let’s dive into some data and generate some ASVs! First things first: we need to organize our sequencing data. If you didn’t bring your own data, you will find the files that we work with today under raw_files, and we will use this location throughout the tutorial today. If you did bring your own data, figure out where they live and put that location here:

data_location <- "~/Documents/training/16S-data-processing/raw_files/"

The data we’re working with today comes from a small subset of publicly available samples (Accession PRJEB42394) analyzed here3 and utilizes 16S rRNA V4 sequencing of nasal microbiome samples from children with asthma. Samples were sequenced on the NextSeq 500.

I have another markdown file with a tutorial for how I found and downloaded these files if you’re interested (See DownloadingFromSRA.html). For now, I have selected a handful of samples from this study that we will use to create an ASV/OTU table with DADA2 and those files reside in the raw_files directory we linked to above.

library(dada2)
Loading required package: Rcpp
read_indicator <- "_1"
all_fastqs <- list.files(data_location, full.names = T)
all_fastqs[1:15]
 [1] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083021_1.fastq.gz"
 [2] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083021_2.fastq.gz"
 [3] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083022_1.fastq.gz"
 [4] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083022_2.fastq.gz"
 [5] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083023_1.fastq.gz"
 [6] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083023_2.fastq.gz"
 [7] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083024_1.fastq.gz"
 [8] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083024_2.fastq.gz"
 [9] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083025_1.fastq.gz"
[10] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083025_2.fastq.gz"
[11] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083026_1.fastq.gz"
[12] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083026_2.fastq.gz"
[13] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083027_1.fastq.gz"
[14] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083027_2.fastq.gz"
[15] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083028_1.fastq.gz"
r1_fastqs <- all_fastqs[grepl(read_indicator, all_fastqs)]
r2_fastqs <- all_fastqs[!grepl(read_indicator, all_fastqs)]
r1_fastqs[1:10]
 [1] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083021_1.fastq.gz"
 [2] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083022_1.fastq.gz"
 [3] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083023_1.fastq.gz"
 [4] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083024_1.fastq.gz"
 [5] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083025_1.fastq.gz"
 [6] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083026_1.fastq.gz"
 [7] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083027_1.fastq.gz"
 [8] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083028_1.fastq.gz"
 [9] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083029_1.fastq.gz"
[10] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083030_1.fastq.gz"
r2_fastqs[1:10]
 [1] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083021_2.fastq.gz"
 [2] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083022_2.fastq.gz"
 [3] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083023_2.fastq.gz"
 [4] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083024_2.fastq.gz"
 [5] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083025_2.fastq.gz"
 [6] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083026_2.fastq.gz"
 [7] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083027_2.fastq.gz"
 [8] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083028_2.fastq.gz"
 [9] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083029_2.fastq.gz"
[10] "/Users/mccauleyke/Documents/training/16S-data-processing/raw_files//ERR5083030_2.fastq.gz"

4 Identifying Primers and Adapters

Primers and adapters are typically present in sequence data as artifacts of the sequencing process, and need to be removed before we start processing our data. For the purposes of our discussion, I’m going to use R to ask if there are any primers in our sequencing data. This dataset examines the V4 region of the 16S rRNA gene, and uses 515F and 806R primers. You can get the primer sequences from the Earth Microbiome Project’s website (https://earthmicrobiome.org/protocols-and-standards/16s/). You may need to obtain different sequences for your primer strategy, but the overall code to search for our (known) primers remains the same.

The important thing to note about the code is that we’re looking for primer sequences at the beginning of the sequence read.

library(ShortRead)
library(Biostrings)
FWD <- "GTGCCAGCMGCCGCGGTAA"
REV <- "GGACTACHVGGGTWTCTAAT"

primerHits <- function(primer, fn) {
  start_seq <- sread(readFastq(fn))
  start_seq <- subseq(start_seq, start=1, end=length(primer)+1)
  nhits <- vcountPattern(primer, start_seq, fixed = FALSE)
  return(sum(nhits > 0))
}
sapply(FWD, primerHits, r1_fastqs[1:10])
sapply(REV, primerHits, r2_fastqs[1:10])

For adapters, you can search for them with tools like FASTQC4, which is used in Nephele’s Pre-Process/Quality Check pipeline (https://nephele.niaid.nih.gov/details_qc/). This provides a good first-pass of data in the early stages and I’ll walk through a demo.

Most often, adapters are found at the end of the read because of “read-through”. This is what happens when the length of the read is longer than the actual size of the region of interest. For instance, if you were sequencing the V4 region, which is 253bp, but you used a 600-cycle MiSeq (which produces 300 base forward and reverse read, you would read into the adapter.

However, you can also find adapters because of unpredictable sequencing artifacts, so it’s always good to check for them.

Since the sequencing methods used for these samples resulted in reads that are shorter than 253 bases and the FASTQC analysis didn’t identify any adapters, we can move forward!

5 Quality Filtering and Trimming

Before we start filtering our data, let’s use the dada2 package’s ability to tell us a little bit more about the quality profile of our samples, which can also be found in the FASTQC results.

plotQualityProfile(r1_fastqs[1:10])

plotQualityProfile(r2_fastqs[1:10])

Given this information, we will use the filterAndTrim function to clean our data. Before we start, we need somewhere for our filtered reads to go, so we will make those paths here. Keep in mind that I’m performing very simple string modification, but you may need to do something more complex for your data and samples.

r1_filt <- gsub("raw_files", "filtered", r1_fastqs)
r2_filt <- gsub("raw_files", "filtered", r2_fastqs)
## In my case, the filtered directory could exist before I make the document -- this `unlink` function just makes sure it's not there when I run filterAndTrim
unlink("filtered/", recursive = T)
out <- filterAndTrim(r1_fastqs, r1_filt, r2_fastqs, r2_filt, truncLen=0, maxN=0, maxEE=2, multithread = F, verbose=T, rm.lowcomplex = 4, rm.phix = T)
Creating output directory: /Users/mccauleyke/Documents/training/16S-data-processing/filtered
Read in 184105 paired-sequences, output 158051 (85.8%) filtered paired-sequences.
Read in 215283 paired-sequences, output 178520 (82.9%) filtered paired-sequences.
Read in 188345 paired-sequences, output 162151 (86.1%) filtered paired-sequences.
Read in 228630 paired-sequences, output 198817 (87%) filtered paired-sequences.
Read in 208557 paired-sequences, output 172454 (82.7%) filtered paired-sequences.
Read in 221399 paired-sequences, output 173492 (78.4%) filtered paired-sequences.
Read in 206371 paired-sequences, output 169799 (82.3%) filtered paired-sequences.
Read in 258864 paired-sequences, output 222692 (86%) filtered paired-sequences.
Read in 248457 paired-sequences, output 204051 (82.1%) filtered paired-sequences.
Read in 232949 paired-sequences, output 199131 (85.5%) filtered paired-sequences.
Read in 192017 paired-sequences, output 156998 (81.8%) filtered paired-sequences.
Read in 182086 paired-sequences, output 147476 (81%) filtered paired-sequences.
Read in 268647 paired-sequences, output 228523 (85.1%) filtered paired-sequences.
Read in 237626 paired-sequences, output 188749 (79.4%) filtered paired-sequences.
Read in 329348 paired-sequences, output 268618 (81.6%) filtered paired-sequences.
Read in 289866 paired-sequences, output 248927 (85.9%) filtered paired-sequences.
Read in 259349 paired-sequences, output 217032 (83.7%) filtered paired-sequences.
Read in 283546 paired-sequences, output 245519 (86.6%) filtered paired-sequences.
Read in 255232 paired-sequences, output 215983 (84.6%) filtered paired-sequences.
Read in 318604 paired-sequences, output 262542 (82.4%) filtered paired-sequences.
out
                      reads.in reads.out
ERR5083021_1.fastq.gz   184105    158051
ERR5083022_1.fastq.gz   215283    178520
ERR5083023_1.fastq.gz   188345    162151
ERR5083024_1.fastq.gz   228630    198817
ERR5083025_1.fastq.gz   208557    172454
ERR5083026_1.fastq.gz   221399    173492
ERR5083027_1.fastq.gz   206371    169799
ERR5083028_1.fastq.gz   258864    222692
ERR5083029_1.fastq.gz   248457    204051
ERR5083030_1.fastq.gz   232949    199131
ERR5083031_1.fastq.gz   192017    156998
ERR5083032_1.fastq.gz   182086    147476
ERR5083033_1.fastq.gz   268647    228523
ERR5083034_1.fastq.gz   237626    188749
ERR5083035_1.fastq.gz   329348    268618
ERR5083036_1.fastq.gz   289866    248927
ERR5083037_1.fastq.gz   259349    217032
ERR5083038_1.fastq.gz   283546    245519
ERR5083039_1.fastq.gz   255232    215983
ERR5083040_1.fastq.gz   318604    262542

Throughout the tutorial, we will review and experiment with some of the options available during this filtering and trimming step.

6 Denoise Reads

To actually get to “sequence variants”, the DADA2 method uses a combination of building an “error” model to determine transition probabilities that we plot below, followed by creating an “abundance p-value”. This “p-value” functions under the null hypothesis that a sequence is too abundant to be explained by error alone. We don’t end up seeing these p-values, but can delve into objects if we’re ever curious.

Schematic of the DADA Algorithm

If you want to learn more about the method, you can read either the DADA5 or the DADA2 papers6

r1_err <- learnErrors(r1_filt, multithread=F, verbose=T)
106718846 total bases in 697539 reads from 4 samples will be used for learning the error rates.
Initializing error rates to maximum possible estimate.
selfConsist step 1 ....
   selfConsist step 2
   selfConsist step 3
   selfConsist step 4
   selfConsist step 5
Convergence after  5  rounds.
r2_err <- learnErrors(r2_filt, multithread=F, verbose=T)
106708240 total bases in 697539 reads from 4 samples will be used for learning the error rates.
Initializing error rates to maximum possible estimate.
selfConsist step 1 ....
   selfConsist step 2
   selfConsist step 3
   selfConsist step 4
   selfConsist step 5
   selfConsist step 6
   selfConsist step 7
Convergence after  7  rounds.
plotErrors(r1_err, nominalQ = T)

plotErrors(r2_err, nominalQ = T)

r1_dada <- dada(r1_filt, r1_err)
Sample 1 - 158051 reads in 21447 unique sequences.
Sample 2 - 178520 reads in 25646 unique sequences.
Sample 3 - 162151 reads in 23906 unique sequences.
Sample 4 - 198817 reads in 24321 unique sequences.
Sample 5 - 172454 reads in 30319 unique sequences.
Sample 6 - 173492 reads in 25660 unique sequences.
Sample 7 - 169799 reads in 23852 unique sequences.
Sample 8 - 222692 reads in 28078 unique sequences.
Sample 9 - 204051 reads in 28810 unique sequences.
Sample 10 - 199131 reads in 29018 unique sequences.
Sample 11 - 156998 reads in 26567 unique sequences.
Sample 12 - 147476 reads in 21121 unique sequences.
Sample 13 - 228523 reads in 28055 unique sequences.
Sample 14 - 188749 reads in 28698 unique sequences.
Sample 15 - 268618 reads in 24519 unique sequences.
Sample 16 - 248927 reads in 25518 unique sequences.
Sample 17 - 217032 reads in 31136 unique sequences.
Sample 18 - 245519 reads in 25746 unique sequences.
Sample 19 - 215983 reads in 34793 unique sequences.
Sample 20 - 262542 reads in 29949 unique sequences.
r2_dada <- dada(r2_filt, r2_err)
Sample 1 - 158051 reads in 66924 unique sequences.
Sample 2 - 178520 reads in 77892 unique sequences.
Sample 3 - 162151 reads in 72509 unique sequences.
Sample 4 - 198817 reads in 75899 unique sequences.
Sample 5 - 172454 reads in 81000 unique sequences.
Sample 6 - 173492 reads in 76405 unique sequences.
Sample 7 - 169799 reads in 74547 unique sequences.
Sample 8 - 222692 reads in 89503 unique sequences.
Sample 9 - 204051 reads in 90341 unique sequences.
Sample 10 - 199131 reads in 88006 unique sequences.
Sample 11 - 156998 reads in 74440 unique sequences.
Sample 12 - 147476 reads in 65312 unique sequences.
Sample 13 - 228523 reads in 93916 unique sequences.
Sample 14 - 188749 reads in 91301 unique sequences.
Sample 15 - 268618 reads in 98481 unique sequences.
Sample 16 - 248927 reads in 93817 unique sequences.
Sample 17 - 217032 reads in 97268 unique sequences.
Sample 18 - 245519 reads in 90577 unique sequences.
Sample 19 - 215983 reads in 101768 unique sequences.
Sample 20 - 262542 reads in 102191 unique sequences.

7 Merge Reads

Now that we’ve corrected our reads based on this information, we can merge our forward and reverse reads. Here, we need to keep in mind our sequencing structure – for the data that we want to work with today, we need to keep in mind that our region of interest is 253 bases, which we can cover with 153 bases on the forward read and 153 bases on the reverse read. Therefore, 306-253=53, suggesting we can’t expect an overlap greater than ~53 bases and we need to keep that in mind. Here is a schematic that might help:

Read Merging Schematic

merged_reads <- mergePairs(r1_dada, r1_filt, r2_dada, r2_filt, verbose=TRUE, minOverlap = 20)
118559 paired-reads (in 263 unique pairings) successfully merged out of 157061 (in 844 pairings) input.
136084 paired-reads (in 168 unique pairings) successfully merged out of 177710 (in 442 pairings) input.
121523 paired-reads (in 137 unique pairings) successfully merged out of 160928 (in 506 pairings) input.
147647 paired-reads (in 242 unique pairings) successfully merged out of 197562 (in 774 pairings) input.
136283 paired-reads (in 174 unique pairings) successfully merged out of 171547 (in 602 pairings) input.
107191 paired-reads (in 174 unique pairings) successfully merged out of 172300 (in 1282 pairings) input.
97053 paired-reads (in 79 unique pairings) successfully merged out of 169115 (in 291 pairings) input.
166082 paired-reads (in 151 unique pairings) successfully merged out of 221983 (in 393 pairings) input.
145218 paired-reads (in 257 unique pairings) successfully merged out of 202899 (in 923 pairings) input.
140775 paired-reads (in 330 unique pairings) successfully merged out of 197555 (in 1057 pairings) input.
114587 paired-reads (in 197 unique pairings) successfully merged out of 155831 (in 662 pairings) input.
113570 paired-reads (in 154 unique pairings) successfully merged out of 146691 (in 425 pairings) input.
166142 paired-reads (in 296 unique pairings) successfully merged out of 227446 (in 925 pairings) input.
132084 paired-reads (in 183 unique pairings) successfully merged out of 187609 (in 771 pairings) input.
197863 paired-reads (in 53 unique pairings) successfully merged out of 268009 (in 157 pairings) input.
177417 paired-reads (in 310 unique pairings) successfully merged out of 247719 (in 1030 pairings) input.
150969 paired-reads (in 359 unique pairings) successfully merged out of 215815 (in 1183 pairings) input.
175735 paired-reads (in 451 unique pairings) successfully merged out of 244468 (in 1416 pairings) input.
162261 paired-reads (in 803 unique pairings) successfully merged out of 214277 (in 2696 pairings) input.
189529 paired-reads (in 153 unique pairings) successfully merged out of 261759 (in 539 pairings) input.
head(merged_reads[[1]])
                                                                                                                                                                                                                                                       sequence
1 TACGTAGGTGACAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAGCGCAGGCGGTCTGTTTAGTCTAATGTGAAAGCCCACGGCTTAACCGTGGAACGGCATTGGAAACTGACAGACTTGAATGTAGAAGAGGAAAATGGAATTCCAAGTGTAGCGGTGGAATGCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGATTTTCTGGTCTAACATTGACGCTGAGGCTCGAAAGCGTGGGGAGCGAACAGG
2 TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTATTTAAGTCAGATGTGAAAGCCCCGGGCTTAACCTGGGAACTGCATCTGATACTGGATAACTAGAGTAGGTGAGAGGGGAGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCTCCCTGGCATCATACTGACACTGAGGTGCGAAAGCGTGGGTAGCAAACAGG
3 TACGTAGGTGACAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAGCGCAGGCGGTCTGTTTAGTCTAATGTGAAAGCCCACGGCTTAACCGTGGAACGGCATTGGAAACTGACAGACTTGAATGTAGAAGAGGAAAATGGAATTCCAAGTGTAGCGGGGGAATGCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGATTTTCTGGTCTAACATTGACGCTGAGGCTCGAAAGCGTGGGGAGCGAACAGG
6 TACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTGTGCAAGTCTGATGTGAAAGCCCCGGGCTTAACCTGGGAACGGCATTGGAGACTGCACGACTAGAGTGCGTCAGAGGGGGGTAGAATTCCGCGTGTAGCAGTGAAATGCGTAGAGATGCGGAGGAATACCGATGGCGAAGGCAGCCCCCTGGGATGACACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG
7 TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATTCAAAACTGACAAGCTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
8 TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGG
  abundance forward reverse nmatch nmismatch nindel prefer accept
1     50707       1       1     53         0      0      1   TRUE
2     10202       2       2     53         0      0      1   TRUE
3      8937       1       3     53         0      0      1   TRUE
6      6168       3       4     53         0      0      1   TRUE
7      5230       4       7     53         0      0      1   TRUE
8      4481       5       5     53         0      0      1   TRUE

8 Make Sequence Table

And finally we can make the first draft of our table!

seq_table <- makeSequenceTable(merged_reads)
dim(seq_table)
[1]   20 2619

From this information, we can see that there are 20 rows for each of the 20 samples and there are columns for the 2619 sequence variants.

We can use basic R functions to explore the dataset, much like we would any table of counts.

Below is the distribution of the sequence lengths for our data. Notice that a large majority of sequences are 253 bases long, which is how long the V4 region is.

table(nchar(getSequences(seq_table)))

 164  165  168  169  199  202  207  209  220  223  227  230  250  251  252  253 
   3    3    3    3    1    1    1    1    1    1    2    2    4    3  295 2178 
 254  255  266  268  276  279  280  285 
 107    2    1    1    3    1    1    1 
Filter for sequence length!

If you look at the table and find that you have sequences that are much larger or smaller than your target amplicon, you can remove through basic R commands like: filt_seq_table <- seq_table[, nchar(getSequences(seq_table)) %in% 250:255]

These abnormal lengths can arise from non-specific priming and typically are not cause for concern.

9 Remove Chimeras

Chimeras are hybrids of two “parent” sequences, meaning that you have one bacterial sequence belonging to, say, E. coli, on one side of the read and another bacterial sequence belonging to Streptococcus on the other. These are typically due to the PCR process, and are considered artifacts that should be filtered out. We do that with the removeBimeraDenovo function, which checks lower abundance sequence variants against higher abundance sequence variants to determine if any match to two separate “parents”.

chimera_check <- removeBimeraDenovo(seq_table, verbose=T)
Identified 1318 bimeras out of 2619 input sequences.
dim(chimera_check)
[1]   20 1301

While we lose about half our sequences due to chimeras, we can determine how many reads were retained through this quick one-liner:

sum(chimera_check)/sum(seq_table)
[1] 0.9791585
Workflow Checkpoint!

Let’s take a look at how our data looks so far in a nice, convenient table!

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(r1_dada, getN), sapply(r2_dada, getN), sapply(merged_reads, getN), rowSums(chimera_check))
colnames(track) <- c("input","filtered","denoised_r1","denoised_r2","merged","nonchimeric")
track
                       input filtered denoised_r1 denoised_r2 merged
ERR5083021_1.fastq.gz 184105   158051      157821      157226 118559
ERR5083022_1.fastq.gz 215283   178520      178456      177763 136084
ERR5083023_1.fastq.gz 188345   162151      161920      161148 121523
ERR5083024_1.fastq.gz 228630   198817      198556      197766 147647
ERR5083025_1.fastq.gz 208557   172454      172180      171785 136283
ERR5083026_1.fastq.gz 221399   173492      173030      172735 107191
ERR5083027_1.fastq.gz 206371   169799      169590      169319  97053
ERR5083028_1.fastq.gz 258864   222692      222567      222099 166082
ERR5083029_1.fastq.gz 248457   204051      203679      203177 145218
ERR5083030_1.fastq.gz 232949   199131      198526      198123 140775
ERR5083031_1.fastq.gz 192017   156998      156541      156255 114587
ERR5083032_1.fastq.gz 182086   147476      147254      146893 113570
ERR5083033_1.fastq.gz 268647   228523      228125      227835 166142
ERR5083034_1.fastq.gz 237626   188749      188396      187952 132084
ERR5083035_1.fastq.gz 329348   268618      268448      268153 197863
ERR5083036_1.fastq.gz 289866   248927      248724      247885 177417
ERR5083037_1.fastq.gz 259349   217032      216652      216165 150969
ERR5083038_1.fastq.gz 283546   245519      245195      244717 175735
ERR5083039_1.fastq.gz 255232   215983      215546      214663 162261
ERR5083040_1.fastq.gz 318604   262542      262414      261864 189529
                      nonchimeric
ERR5083021_1.fastq.gz      117604
ERR5083022_1.fastq.gz      135350
ERR5083023_1.fastq.gz      119800
ERR5083024_1.fastq.gz      146898
ERR5083025_1.fastq.gz      135089
ERR5083026_1.fastq.gz      106009
ERR5083027_1.fastq.gz       96420
ERR5083028_1.fastq.gz      165705
ERR5083029_1.fastq.gz      141256
ERR5083030_1.fastq.gz      137064
ERR5083031_1.fastq.gz      113687
ERR5083032_1.fastq.gz      112657
ERR5083033_1.fastq.gz      163770
ERR5083034_1.fastq.gz      130144
ERR5083035_1.fastq.gz      196944
ERR5083036_1.fastq.gz      168845
ERR5083037_1.fastq.gz      145564
ERR5083038_1.fastq.gz      165918
ERR5083039_1.fastq.gz      148588
ERR5083040_1.fastq.gz      188891

As you can see, there was a slight loss of reads during filtering and merging for these particular samples. For now, there isn’t enough evidence that we need to double-check our workflow so far, but if the data ends up more problematic, this table might help target a step to focus on.

10 Assign Taxonomy

We now have 1301 sequence variants and we want to determine what bacteria they are assigned to. In order to do this, we need to download specially-formatted databases (you can find several different databases here, some of which are contributed by outside teams). For our purposes, we’ll download the SILVA database7. You’ll need all three files and we’ll install them to a folder called databases that is in our working directory.

taxa <- assignTaxonomy(chimera_check, "databases/silva_nr99_v138.1_train_set.fa.gz", multithread=F, verbose=T, minBoot = 80)
Finished processing reference fasta.
taxa_print <- taxa
rownames(taxa_print) <- NULL
head(taxa_print,10)
      Kingdom    Phylum             Class                 Order              
 [1,] "Bacteria" "Firmicutes"       "Bacilli"             "Lactobacillales"  
 [2,] "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Pseudomonadales"  
 [3,] "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Burkholderiales"  
 [4,] "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Pseudomonadales"  
 [5,] "Bacteria" "Firmicutes"       "Bacilli"             "Staphylococcales" 
 [6,] "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Enterobacterales" 
 [7,] "Bacteria" "Actinobacteriota" "Actinobacteria"      "Corynebacteriales"
 [8,] "Bacteria" "Firmicutes"       "Bacilli"             "Lactobacillales"  
 [9,] "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Pseudomonadales"  
[10,] "Bacteria" "Firmicutes"       "Bacilli"             "Lactobacillales"  
      Family               Genus            
 [1,] "Carnobacteriaceae"  "Dolosigranulum" 
 [2,] "Moraxellaceae"      "Moraxella"      
 [3,] "Chromobacteriaceae" "Chromobacterium"
 [4,] "Pseudomonadaceae"   "Pseudomonas"    
 [5,] "Staphylococcaceae"  "Staphylococcus" 
 [6,] "Pasteurellaceae"    "Haemophilus"    
 [7,] "Corynebacteriaceae" "Corynebacterium"
 [8,] "Carnobacteriaceae"  "Dolosigranulum" 
 [9,] "Moraxellaceae"      "Moraxella"      
[10,] "Streptococcaceae"   "Streptococcus"  

For today, we will only assign taxonomy to the genus level. Taxonomy assignment to the species level occurs using a separate function (assignSpecies), but is considered fully valid for DADA2 sequences. The method attempts to identify the genus and species independently and then only assigns species if the genus calls match.

11 Make Phylogenetic Tree

Let’s arrange our unique sequences into a phylogenetic tree so that we know how similar they are to each other. This tree can be used for statistical questions or visualizations later. As a disclaimer, there are a few additional steps needed to “optimize” the fit of the tree, but we won’t do them here because it can a long time. If you would like to explore this method of tree building further, you can find additional documentation here.

library(DECIPHER)
Warning: package 'IRanges' was built under R version 4.3.1
Warning: package 'GenomeInfoDb' was built under R version 4.3.1
library(phangorn)
seqs <- as.character(getSequences(chimera_check))
names(seqs) <- seqs
alignment <- DECIPHER::AlignSeqs(DNAStringSet(seqs), anchor=NA)
Determining distance matrix based on shared 8-mers:
================================================================================

Time difference of 9.86 secs

Clustering into groups by similarity:
================================================================================

Time difference of 0.19 secs

Aligning Sequences:
================================================================================

Time difference of 3.19 secs

Iteration 1 of 2:

Determining distance matrix based on alignment:
================================================================================

Time difference of 0.75 secs

Reclustering into groups by similarity:
================================================================================

Time difference of 0.15 secs

Realigning Sequences:
================================================================================

Time difference of 1.79 secs

Iteration 2 of 2:

Determining distance matrix based on alignment:
================================================================================

Time difference of 0.7 secs

Reclustering into groups by similarity:
================================================================================

Time difference of 0.15 secs

Realigning Sequences:
================================================================================

Time difference of 0.44 secs

Refining the alignment:
================================================================================

Time difference of 0.13 secs
phang.align <- as.phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm)
fit <- pml(treeNJ, data=phang.align)

12 Generate a Phyloseq Object

Now we have the components we need to make an object that is compatible with various microbiome packages. My personal preference is phyloseq, but there are other packages out there that do the same or similar things, including ampvis2 and SummarizedExperiment. I currently don’t have sample data on-hand for these samples, so we will make some dummy data first.

library(phyloseq)

Attaching package: 'phyloseq'
The following object is masked from 'package:IRanges':

    distance
dummy_sample_data <- data.frame(samples=rownames(chimera_check), treatment_group=sample(c("A","B"), size=20, replace=TRUE))
rownames(dummy_sample_data) <- dummy_sample_data$samples

phy_obj <- phyloseq(otu_table(chimera_check, taxa_are_rows=F), sample_data(dummy_sample_data), phy_tree(fit$tree), tax_table(taxa))
phy_obj
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1301 taxa and 20 samples ]
sample_data() Sample Data:       [ 20 samples by 2 sample variables ]
tax_table()   Taxonomy Table:    [ 1301 taxa by 6 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 1301 tips and 1299 internal nodes ]

With this new phyloseq object, we can do easy and exciting visualizations with the phyloseq package. For a simple visualization, we will create a bar plot and some PCoA plots.

library(ggplot2)
plot_bar(phy_obj, fill="Genus") + guides(fill="none")

ord <- ordinate(phy_obj, method="PCoA", distance = "bray")
plot_ordination(phy_obj, ord)

ord <- ordinate(phy_obj, method="PCoA", distance = "wunifrac")
Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root as --
TACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGCGAGATGTGAAAGCCCTGGGCTCAACCTAGGAATAGCATTTCGAACTGGCGAACTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG
-- in the phylogenetic tree in the data you provided.
plot_ordination(phy_obj, ord)

Footnotes

  1. https://nephele.niaid.nih.gov/↩︎

  2. Callahan, B., McMurdie, P., Rosen, M. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13, 581–583 (2016). https://doi.org/10.1038/nmeth.3869↩︎

  3. McCauley KE, Flynn K, Calatroni A, DiMassa V, LaMere B, Fadrosh DW, Lynch KV, Gill MA, Pongracic JA, Khurana Hershey GK, Kercsmar CM, Liu AH, Johnson CC, Kim H, Kattan M, O’Connor GT, Bacharier LB, Teach SJ, Gergen PJ, Wheatley LM, Togias A, LeBeau P, Presnell S, Boushey HA, Busse WW, Gern JE, Jackson DJ, Altman MC, Lynch SV; National Institute of Allergy and Infectious Diseases–sponsored Inner-City Asthma Consortium. Seasonal airway microbiome and transcriptome interactions promote childhood asthma exacerbations. J Allergy Clin Immunol. 2022 Jul;150(1):204-213. doi: 10.1016/j.jaci.2022.01.020. Epub 2022 Feb 8. PMID: 35149044.↩︎

  4. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/↩︎

  5. Rosen MJ, Callahan BJ, Fisher DS, Holmes SP. Denoising PCR-amplified metagenome data. BMC Bioinformatics. 2012 Oct 31;13:283. doi: 10.1186/1471-2105-13-283. PMID: 23113967; PMCID: PMC3563472.↩︎

  6. Callahan BJ, Wong J, Heiner C, Oh S, Theriot CM, Gulati AS, McGill SK, Dougherty MK. High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. Nucleic Acids Res. 2019 Oct 10;47(18):e103. doi: 10.1093/nar/gkz569. PMID: 31269198; PMCID: PMC6765137.↩︎

  7. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucl. Acids Res. 41 (D1): D590-D596.↩︎